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Abstract 


The  path-indexing  method  for  indexing  first-order  predicate  calculus  terms  is  a  refinement 
of  the  standard  coordinate-indexing  method.  Path  indexing  offers  much  faster  retrieval  at  a 
modest  cost  in  space.  Path  indexing  is  compared  with  discrimination-net  and  codeword  in¬ 
dexing.  While  discrimination-net  indexing  may  often  be  the  preferred  method  for  maximum 
speed,  path  indexing  is  an  effective  alternative  if  discrimination-net  indexing  requires  too 
much  space  or  in  certain  cases  in  which  discrimination-net  indexing  performs  particularly 
poorly. 

1  Introduction 

Artificial  intelligence  (AI),  logic  programming,  and  automated  deduction  systems  are  often 
required  to  deal  with  large  amounts  of  symbolic  information.  The  need  to  store  large 
amounts  of  information  is  met  in  conventional  applications  by  database  systems,  but  the 
form  of  the  data  in  AI,  logic  programming,  and  automated  deduction  applicaitions  requires 
somewhat  different  techniques,  particularly  in  the  context  of  indexing  the  data  for  effective 
retrieval. 

It  is  necessary  in  these  applications  to  retrieve  entries  that  are  indexed  by  values  that  are 
at  least  as  general  as  first-order  predicate  calculus  terms.  A  first-order  predicate  calculus- 
term  is  either  (1)  a  constant,  (2)  a  variable,  or  (3)  an  n-ary  function  symbol  applied  to  n 
term  arguments.  Examples  are  the  constant  terms  John,  widget,  and  3.1416;  the  variable 
terms  x  and  y;  and  the  composite  terms  (a  -|-  r)  —  3,  father{John),  and  append(nil,  x,  x). 
Conventional  databases  can  easily  index  only  constant  terms,  e.g.,  numbers  and  strings. 

Besides  retrieving  exact  matches  or  accepting  any  value  as  in  conventional  database 
retrieval  operations,  it  is  necessary,  for  example,  to  be  able  to  specify  retrieval  of  all  stored 
terms  that  are  unifiable  with  x  -|-  (— r). 

Terms  can  be  related  in  several  ways.  A  pair  of  terms  can  be  equal,  or  equal  except  for 
the  names  of  the  -variables,  i.e.,  the  terms  are  variants.  One  term  can  be  an  instance  of 


1 


another — the  first  is  equal  to  the  second  with  terms  substituted  for  its  variables.  Conversely, 
one  term  can  be  a  generalization  of  another.  Finally,  the  terms  can  be  unifiable,  i.e.,  have 
a  common  instance. 

Retrieval  on  a  field  containing  keys  that  are  first-order  predicate  calculus  terms  may 
require  finding  terms  in  the  database  that  are 

•  Equal  to  the  term  in  the  retrieval  request 

•  Variants  of  the  term  in  the  retrieval  request 

•  Instances  of  the  term  in  the  retrieval  request 

•  Generalizations  of  the  term  in  the  retrieval  request 

•  XJnlfiable  with  the  term  in  the  retrieval  request. 

The  need  for  each  form  of  retrieval  can  be  illustrated  in  the  field  of  automated  deduc¬ 
tion  [1,11].  Automated  deduction  systems  often  require  all  these  forms  of  retrieval.  Other 
applications,  such  as  logic  programming  and  expert  systems,  often  use  only  a  subset.  To 
determine  if  a  newly  derived  formula  is  already  present  in  the  database,  it  is  necessary  to 
retrieve  terms  that  are  equal  to  or  variants  of  the  new  formula.  It  is  necessary  to  find  in¬ 
stances  (resp.,  generalizations)  of  a  newly  derived  formula  to  perform  equality  simplification 
or  subsumption^  of  formulas  in  the  database  by  the  new  formula  (resp.,  of  the  new  formula 
by  formulas  in  the  database).  Other  operations,  such  as  resolution,  may  require  unifiable 
terms.  Prolog  inference,  as  a  special  case  of  resolution,  requires  retrieval  of  clauses  whose 
head  literal  is  unifiable  with  the  current  goal. 

The  path-indexing  method  we  propose  is  a  refinement  of  the  coordinate-indexing  method 
that  was  proposed  for  the  PLANNER  AJ  programming  Icinguage  [4]  tind  was  used  in  the 
Logic  Machine  Architecture  (LMA)  [7]  for  implementing  deduction  systems  (where  it  is 
called  FPA  indexing).  Path  indexing  offers  substantially  faster  retrieval  in  exchange  for  a 
modest  increase  in  storage  cost. 

^In  deduction,  subsumption  is  the  deletion  of  assertions  that  are  instances  of  more  general  assertions.  It 
is  done  to  eliminate  redundant  information  and  reduce  the  size  of  the  search  space. 


2 


o 


/  \ 

1/  \2 
/  \ 
o  o 

/l\ 

1/  I  \3 
/  12  \ 
o  o  o 
/  \ 

1/  \2 

/  \ 

O  0 


Figure  1:  Tree  corresponding  to  term  f[a,g[b,x,h{y,z))). 


Section  2  contains  a  description  of  the  simple  and  familiaj  coordinate-indexing  method. 
Section  3  defines  the  new  path-indexing  method,  which  is  easily  understood  by  comparing 
it  with  the  coordinate-indexing  method.  An  implementation  approach  is  given  in  Sec¬ 
tion  4.  Path  indexing  is  compared  with  coordinate  indexing  in  Section  5  and  compared 
with  discrimination-net  and  codeword  indexing  methods  in  Section  6. 


2  Coordinate  Indexing 

The  term  t  =  f{a,g{b,x,h{y,z)))  can  be  specified  by  a  mapping  Symbolt  from  the  coordi¬ 
nates  (),  (1),  (2),  (2, 1),  (2,2),  (2, 3),  (2,3, 1),  (2,3, 2)  of  the  tree  in  Figure  1  to  the  symbols 
in  t: 


SyTnbolt{{)) 
Symbolt{{l)) 
Symbolt{{2)) 
Symboli{{2,l)) 
Symbolt{{2,2)) 
SymboIt{{2, 3)) 
Symbolt{{2,3,l)) 
Symbolt{{2,3,2)) 


=  / 
=  a 

=  9 
=  b 

=  h 
=  * 
=  + 
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Let  Terms  be  the  set  of  terms  stored  for  possible  retrieval.  Then  the  following  can  be 
used  to  define  methods  for  retrieving  those  terms  in  Terms  that  are  variants  of,  instances 
of,  generalizations  of,  or  unifiable  with  a  term.^  GeiVarianis{{),t)  returns  a  subset  of 
Terms  that  includes  all  variants  of  i;  GeiInstances{Q ^i)  returns  a  subset  that  includes 
all  instances;  GetGeneralizations{(^^t)  returns  a  subset  that  includes  aU  generalizations; 
GeiU hifiables{{),i)  returns  a  subset  that  includes  all  terms  unifiable  with  t.  If  t  and  Terms 
are  all  linear  terms,  i.e.,  do  not  contain  repeated  variables,  then  these  retrieval  operations 
are  exact.  If  there  are  nonlinear  terms,  extra  terms  may  be  retrieved.  The  expression  p  •  i 
denotes  the  coordinate  p  extended  by  i,  e.g.,  (2)  •  3  is  (2, 3). 

Let  GetTerms{p^s)  =  {t  6  Terms\Symbolt{p)  =  s}.  Retrieval  formulas  will  be  com¬ 
posed  of  set  union  and  intersection  operations  applied  to  GeiTerms  sets.  Methods  for 
efficiently  computing  the  GeiTerms  sets  and  their  unions  and  intersections  are  described 
in  Section  4. 


GeiV  ariants{p^  i) 
GetVarianis{p,  a) 
GetV ariants{p,  /(ti , . . . ,  in)) 


GetTerms{p,  +) 

GetTerms{p,  a) 

GeiTerms{p,  f)  fl 
GeiVariants{p  ■  l,ti)  n  *  •  •  fli 
GetV ariants{p  •n,tn) 


Getlnstances{p,  x) 
Getlnsiances{p,  a) 
Getlnstances{p^  f{i\,. .  .,tn)) 


Terms'^ 

GetTerms[p,  a) 

GetTerms{p,  f)  D 
Getlnstances{p  •  l,ti)  D  ■  •  ■  n 
Getlnstances{p  •  n,  i„) 


®Theae  indexing  methods  ignore  the  identity  of  variables,  so  all  are  replaced  by  ♦. 

^Eqnal  terms  can  be  retrieved  by  retrieving  all  variants  and  cheddng  them  for  equality.  If  there  is 

sufficient  need  for  equal  retrievals,  each  term  can  be  stored  twice,  the  second  time  with  its  variables  indexed 

as  if  they  were  constants.  Equal  terms  can  then  be  found  by  retrieving  variants  of  the  term  in  the  retrieval 

request,  with  variables  again  treated  as  if  they  were  constants. 
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GetGeneralizations{p,  i) 
GetGeneralizations(p,  a) 
G etGeneralizations(p^  f{ti, . . . ,  f„)) 


GetTerms{p,  *) 

GetTer‘ms{p,  *)  U  GetTerms{p,  a) 
GetTerms{p,  *)  U 
(GetTerms(p,  f)  n 
GetGeneralizations{p  •  l,fi)  PI  ■  •  •  (1 
GetGeneralizations(p  ■  n,  in)) 


GeiUnifiables(p,x)  =  Terms 

GeiUnifiables(p,a)  =  GeiTerms(p,*)U  GetTerms(p,a) 
GeiUnifiables(p,  f(ii, . . .  ,in))  =  GeiTerms(p,*)  U 

(GetTerms(p,  f)  n 
GeiUniftables(p-  l,ii)  fl  •  •  •  fl 
GetUnifiables{p  •  n,  i„)) 

For  example,  instances  of  the  term  f{a,g{b,x,h(y,z)))  can  be  retrieved  by 

GetTerms{{),  f)r\ 

GetTerms{{l),a)r] 

GetTerms({2),g)r\  (1) 

G€tTerms{{2,  l),t)n 
GetTerms{{2, 3),  h) 

and  terms  uniflable  with  it  can  be  retrieved  by 

GetTerms{{),  f) 

n 

GetTerms{{l),*)  U  GetTerms{{l) ,  a) 

n 

GetTerms{{),*)U  '  GeiTerms{{2),g) 

n 

GetTerms{{2)y*)  U  GetTerms{{2y  1),+)  U  GetTerms{{2, 1),6) 

n 

GetTerms{{2,3)y*)U  G€tTerms{{2y  3),/i) 

_ _  (2) 

■‘Occurrences  of  Terms  are  effectively  ignored,  since  the  retrieval  formula  Terms  n  X  can  be  simpUiied 
to  X. 
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3  Path  Indexing 


While  coordinate  indexing  certainly  yields  the  correct  result,  each  GeiTerms  may  contain 
many  irrelevant  terms.  In  particular,  GetTerms({),  f)  is  a  set  of  all  the  terms  whose  top 
function  symbol  is  /,  regardless  of  the  arguments,  and  GetTerTns{{l),a)  returns  a  list  of 
all  terms  whose  first  argument  is  a,  regardless  of  the  top  function  symbol. 

Coordinate  indexing  describes  terms  by  a  mapping  from  coordinates  to  symbols.  Path 
indexing  describes  terms  by  a  mapping  from  paths  to  symbols.  The  term  t  =  /(a,  g{b,x,h{y,  z))) 
can  be  specified  by  a  mapping  Symboli  from  the  paths  (),  (/,  1),  (/,  2),  (/,  2,5, 1),  (/,  2,5, 2), 

(/,  2,5,3),  (/,2,5, 3,  h,  1),  (/,  2,5,3,  h,2)  from  the  root  node  to  the  nodes  of  the  tree  in  Fig¬ 
ure  1,  with  function  symbols  included^  to  the  symbols  in  t: 

Symboli{{))  -  f 
Symbolt{{f,l))  =a 
Symboli{(f,2))  =g 
Symbolt{{f,2,g,l))  =6 
Symbolt{{f,2,g,2))  =  *5 
SyTnbolt{{f,2,g,S))  =h 
Symbolt{{f,2,g,3,h,l))  =* 

Symbolt{{f,  2, 5, 3,  h,  2})  =  * 

Sy'mbolt{{f,2,g,l))  =  b  should  be  interpreted  as  saying  that  t  has  the  symbol  b  as  the 
first  argument  of  5,  which  is  the  second  argument  of  the  top  function  symbol  /.  Thus, 
GetTerms({f,2,g,  1),  6) in  the  path-indexing  method  has  the  same  value  as  GetTeTTns{{) ,  f)n 
GetTerTns{{2),g)n  GetTerms{{2,l)yb)  in  the  coordinate-indexing  method. 

Let  Terms  be  the  set  of  terms  stored  for  possible  retrieval.  Then  the  following  can 
be  used  to  define  the  path-indexing  method  for  retrieving  those  terms  in  Terms  that  are 
variants  of,  instances  of,  generalizations  of,  or  unifiable  with  a  term.  If  t  and  Terms  are 
all  linear  terms,  i.e.,  do  not  contain  repeated  variables,  then  these  retrieval  operations  are 
exact.  If  there  za'e  nonlinear  terms,  extra  terms  may  be  retrieved.  The  expression  p  •  /  •  i 
®These  indexing  methods  ignore  the  identity  of  variables,  so  all  Me  replaced  by  ♦. 
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denotes  the  path  p  extended  by  the  argument  position  of  function  /,  e.g.,  {f,2)  ■  g  -  2 
is  (/,  2, 5,3).  The  cases  of  terms  with  all  variable  arguments  /(xj, . . . ,  a;„)  and  not  all 
variable  arguments  /(ti, ..  where  at  least  one  f,-  is  assumed  not  to  be  a  variable,  are 

distinguished  in  Getinstances  and  GetUnifiables. 

Let  GetTerms{p,s)  =  {i  E  Terms\Symboli{p)  =  s).  Retrieval  formulas  will  be  com¬ 
posed  of  set  union  and  intersection  operations  applied  to  GetTerms  sets.  Methods  for 
efficiently  computing  the  GetTerms  sets  and  their  unions  and  intersections  are  described 
in  Section  4. 


GetVaTianis(p,x)  = 
GetV aTiants{p,  a)  = 
GetVarianis{p,f{ti,...,tn))  = 


Getlnstances{p,x)  = 
Getlnstances{p,  a)  = 
Getlnstances{p,  f{xi, . .  .,Xn))  = 
Getlnstances{p,f{ii,...,tn))  = 


GeiGeneralizations{p,x)  = 
GetGeneraIizations{p,a)  — 
GetGeneralizations{p,  /(fi , . . . ,  t„))  =: 


GetUnifiables{pyx)  = 

“Occurrences  of  Terms  are  effectively  ignored,  sii 
to  X. 


GeiTeTms{p,  +) 

GeiTeTms[p,  a) 

GetVarianis{p  •  /  •  l,ti)  D  •  •  •  fl 
GetVariants{p  •  f  •  n,tn) 


T  erms^ 

GeiTerms^p,  a) 

GetTerms(p,  /) 

Getlnsiances{p  ■  /  •  1,  ti)  n  •  •  •  fl 
Getlnstances{p  •  f  •  n,t„) 


GeiT€rms{p,*) 

GetTerms{p,  *)  U  GetTerms{p,  a) 
GeiTerms^p,  +)  U 

{GetGeneralizaiions{p  •  /  •  l,ti)  PI  •  •  •  Pi 
GetGeneralizations{p  •  /  •  n, 

Terms 

e  the  retrieval  formula  T erms  n  X  can  be  simplified 
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GetUnifiables{p,a)  =  GetTeTTns{p,*)U  GeiTerms{p,a) 
GeiUnifiables{p,f{xi,...,Xn))  =  GeiTeTms{p,*)\J  GeiTerms{p,  f) 
GetUnifiables{p,  .  .,tn))  —  GetTerTns{pj*)U 

{GetUnifiables{p' f  ■ -n 

GetUnifiables{p  ■  f  -n.in)) 

For  example,  instances  of  the  term  /(a,5'(6, 1,/l(J/,^)))can  be  retrieved  by 

GetTerm${{f,  l),a)n 

GetT'erms((/,2,p,l),6)n  (3) 

GetTerms{{f,  2,p,  3),/i) 

and  terms  unifiable  with  it  can  be  retrieved  by 


GetTerms{{),*)\J 


GetTer7ns{{f,  2),  +)  U 


GetTerms{{f,l),  *)  U  GetTerms{{f,  1),  a) 

n 

GeiTerms{{f  ,2, g ,  1),  +)U  GetTerms{{f,2,g^  l)i^) 

n 

GetTerms{{f^2,g,5),  +)U  GetTeTms{{f,2,g,S),h) 

(4) 


4  Implementation 

An  efficient  implementation  of  coordinate  or  path  indexing  depends  on  (1)  efficiently  com¬ 
puting  GetTeTms  sets  and  (2)  efficiently  computing  unions  and  intersections  of  GetTerms 
sets. 

4.1  Computing  GetTerms  Sets 

For  the  fastest  retrieval,  the  set  of  terms  in  GetTeTms{p,s)  for  coordinate  or  path  p  and 
symbol  s  is  explicitly  stored.  Thus,  for  each  term  in  Terms  (the  set  of  all  terms  stored  for 
possible  retrieval),  a  pointer  to  the  term  is  stored  in  a  GetTerms  set  for  each  symbol  in  the 
term. 

There  is  a  large  enough  number  of  GetTerms  sets  to  make  it  necessary  to  consider  how 
to  find  the  GeiTerms{p,s)  sets  efficiently  given  p  and  s. 
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One  approach  is  to  store  GetTerms(p,  s)  in  a  hash  table.  For  example,  hash{p,  s)  could 
compute  an  integer  index  n  into  array  A  such  that 

A(n)  =  {(pi,  Si,GetTerms{pi,  Si))\hash(pi,  S{)  =  n}. 

GetTerms^p,  s)  could  be  found,  if  present,  by  comparing  p  and  s  with  each  p,-  and  S{  in 
A{n). 

An  implementation  should  take  account  of  the  frequent  occurrence  of  common  initial 
subsequences  of  coordinates  or  paths  to  reduce  unnecessary  computation  and  storage  costs. 
For  example,  computing  /ics/i({/,  2, p,  1),  6)  and  hash{{f,2,g,3),h)  can  share  the  cost  of 
computing  a  hash  value  for  the  common  initial  subsequence  {/, 2,5)  of  the  two  paths. 
Likewise,  the  two  paths  can  be  stored  in  the  hash  table  with  structure  shajing  of  the 
common  initial  subsequence. 

A  second  approach  is  to  construct  a  discrimination  net  [2]  (also  see  Section  6.1)  for  keys 
(p,  s)  and  to  find  GetTerTns{p,  s)  by  traversing  the  nodes  of  the  discrimination  net  in  accord 
with  (p,  s). 

Nodes  in  the  discrimination  net  are  reached  from  their  parent  nodes  by  integer  and 
symbol  lookup  operations: 

Up  :  node  X  integer  node 
sip  :  node  x  symbol  — >  node. 

The  discrimination-net  node  N  corresponding  to  the  end  of  the  path  p  =  (/,  2,p,  1),  for 
example,  can  be  found  by 

N  =  ilp{slp{ilp(slp{No,f),2),g),l), 

where  Nq  is  the  top  node  of  the  discrimination  net.  The  Up  integer  lookup  operation  can 
be  implemented  as  an  array  reference  operation.  The  sip  symbol  lookup  operation  can 
be  implemented  as  a  hash  table  or  association  list  lookup  operation,  or  an  array  reference 
operation  on  an  integer  corresponding  to  the  symbol.  Array  references  on  integers  corre¬ 
sponding  to  the  symbol  provide  fast  constant-time  symbol  lookup,  but  can  be  wasteful  of 
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space  if  the  number  of  symbols  is  large  and  nodes  often  have  few  successors.  Association 
lists  provide  space-efficient  nonconstant-time  symbol  loohup  that  is  fast  if  the  number  of 
successors  is  small.  Hash  tables  provide  nearly  constant-time  symbol  lookup  with  space 
requirements  that  may  be  between  those  of  the  other  techniques. 

The  value  of  GeiTerms{p,  *)  is  stored  in  a  field  of  node  A;  the  value  of  GeiTerms{p,  s) 
is  stored  in  a  field  of  node  slp{N,  s). 

This  approach  eliminates  the  need  to  store  p  and  s  with  GeiTerms{p,  s)  as  in  the 
hash  table  approach  and  eliminates  the  need  to  compare  p  and  s  with  stored  p,s  and  s,s. 
The  absence  of  this  comparison  may  make  the  discrimination-net  approach  asymptotically 
superior,  although  the  hash-table  approach  may  still  perform  well  in  practical  cases. 

4.2  Computing  Unions  and  Intersections  of  GetTerms  Sets 

The  set  union  and  intersection  operations  specified  in  the  retrieval  formulas  can  always  be 
done  in  time  proportional  to  the  sum  of  the  sizes  of  the  GetTerms  sets  if  each  entry  has 
an  extra- mark  held. 

For  example,  Eq  (3)  for  retrieving  instances  of  f[a,g(b,x,h(y,z)))hy  path  indexing  can 
be  computed  by  the  algorithm 

1.  Mark  with  1  every  entry  in  GeiTeTms({f,l),a). 

2.  Mark  with  2  every  entry  in  GetTerms{{f,2,g,  1),6)  that  is  now  marked  with  1. 

3.  Retrieve  every  entry  in  GetTeTms{{f,2,g,3),h)  that  is  now  marked  with  2. 

Equation  (4)  for  retrieving  terms  unifiable  with  it  can  be  computed  by  the  algorithm 

1.  Mark  with  1  every  entry  in  GeiTerms{{f,l),a)  or  GetT€rms{{f,l),*). 

2.  Mark  with  2  every  entry  in  GeiTerms{{f,2,g,l),b)  or  GeiTerms{{f,2,g,l),*)  that 
is  now  marked  with  1. 

3-  R.etrieve  every  entry  in  GeiTerms{{fy2,g,d),h)  or  GetT€rms{{f,2,g,S)^*)  that  is 
now  marked  with  2. 

4.  R,etrieve  every  entry  in  GetTerms{{f,2),*)  that  is  now  marked  with  1. 

5.  R,etrieve  every  entry  in  GetTeTms{{),*). 
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5  Comparison  of  Coordinate  and  Path  Indexing 


Coordinate  and  path  indexing  return  identical  results,  but  path  indexing  can  be  expected 
to  result  in  substantially  faster  retrieval  than  coordinate  indexing. 

The  price  for  this  increase  in  speed  is  a  modest  increase  in  storage  cost.  The  same 
number  of  pointers  to  terms  are  stored  in  both  methods,  but  they  are  divided  into  a  larger 
number  of  GetTerms  sets  in  path  indexing  than  in  coordinate  indexing.  The  extra  storage 
cost  comes  from  the  extra  indexing  structure  required  to  distinguish  among  all  the  paths  of 
the  terms  instead  of  all  the  coordinates  of  the  terms. 

Path  indexing  appears  to  be  more  feasible  for  storing  and  retrieving  unordered  collections 
of  terms. 

5.1  Retrieval  Time 

Inspection  of  the  definitions  of  GetVarianis,  GetJnsiances,  GetGeneralizaiions,  and 
GeiUnif tables  for  coordinate  and  path  indexing  immediately  reveals  the  reason  for  the 
expected  superiority  of  path  indexing’s  retrieval  time.  Path  indexing  consistently  performs 
set  union  and  intersection  operations  on  sets  of  terms  that  can  reasonably  be  expected  to 
be  much  smaller  than  those  in  coordinate  indexing. 

Consider  Eq.  (1)  and  Eq.  (3) ,  which  describe  the  retrieval  of  instances  of  /(a,  g{b,  x,  h{y,  z))) 
by  the  two  methods.  Path  indexing  computes  the  intersection  of  the  sets 

GetTerTns{{f,l),a) 

GetTerms{{f,2,  g ,  1),  6) 

GeiTerTns{{f,2,g,3),h) 

whereas  coordinate  indexing  computes  the  intersection  of  the  probably  larger,  and  certainly 
not  smaller,  sets 


GetTerms{{l),  a) 
GeiT€rms((2, 1),  b) 
GetTerms  ( (2, 3) ,  h) 
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which  must  still  be  intersected  with  the  sets 


GetTerms({),  f) 

GetTerTns{{2),g) 

Comparison  of  Eq.  (2)  and  Eq.  (4)  for  retrieving  terms  that  are  unifiable  with  f{a,  g{b,x,h{y,z))) 
also  demonstrates  the  expected  superiority  of  path  indexing’s  retrieval  time. 

The  following  table  shows  the  number  of  GetTerms  in  the  retrieval  formula  for  coordi¬ 
nate  indexing  and  path  indexing.  Only  in  extreme  cases  (no  function  symbols  or  no  function 
symbols  with  nonvariable  arguments)  does  path  indexing  require  as  many  GetTerms  as  co¬ 
ordinate  indexing,  and  it  never  reqmres  more. 


Number  of  GetTerms  in  Retrieval  Formula 

Retrieval  Type 

Coordinate 

Indexing 

Path 

Indexing 

Variant 

Instance 

Generalization 

Unifiable 

V+C  +  F 

C  +  F 

V  -^20  + 2F 
2C  +  2F 

V  +  C 

C  +  Fv 

V  +  2C  +  F 
2C^F  +  Fv 

V  =  number  of  variable-symbol  occurrences  in  retrieval  term 
C  =  number  of  constant-symbol  occurrences  in  retrieval  term 
F  =  number  of  function-symbol  occurrences  in  retrieval  term 
Fv  =  number  of  function-symbol  occurrences  with  only  variable  arguments  in 
retrieval  term 


The  minimum  number  of  symbol  and  integer  lookups  {sip  and  Up  operations)  reqiiired 
is  the  same  for  the  two  procedures; 

•  C  +  F  symbol  lookups. 

•  V  C  F  —  1  integer  lookups  for  variant  and  generalization  retrievals. 

•  C  F  —  1  integer  lookups  for  instance  and  unifiable  retrievals. 

The  preceding  is  actually  a  worst-case  analysis.  Retrieval  of  some  GetTerms  sets  and 
some  symbol  and  integer  lookup  operations  can  be  eliminated  if,  for  example,  a  symbol 
lookup  has  no  resulting  node.  This  can  happen  because  the  discrimination  net  may  contain 
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only  those  nodes  necessary  to  index  the  terms  actucdly  stored,  not  all  those  that  might 
be  used  in  retrieval  requests.  The  more  detailed  indexing  by  function  symbol  and  argu¬ 
ment  position  of  path  indexing,  as  compared  with  indexing  by  argument  position  only  in 
coordinate  indexing,  makes  this  elimination  of  effort  more  likely  in  path  indexing  than  in 
coordinate  indexing.  This  further  enhances  the  superiority  of  path  indexing. 

The  smaller  size  of  the  GetTerms  sets  for  path  indexing  is  a  major  contributor  to 
the  method’s  superiority  over  coordinate  indexing.  However,  the  magnitude  of  the  size 
reduction  of  GetTerms  sets  depends  on  the  stored  terms  themselves,  so  we  cannot  present 
a  formal  comparison. 

5.2  Storage  Cost 

Storage  cost  is  modestly  greater  for  path  indexing  than  for  coordinate  indexing. 

We  expect  each  GetTeTms{p,s),  the  set  of  all  terms  with  symbol  s  at  coordinate  or  end 
of  path  p,  to  be  stored  explicitly.  For  each  symbol  occurrence  in  a  stored  term  t,  a  pointer 
to  t  will  be  stored  in  one  of  these  GetTerms  sets.  Thus,  for  both  coordinate  and  path 
indexing,  the  total  number  of  pointers  to  stored  terms,  or  the  total  size  of  the  GetTerms 
sets,  is  the  just  the  sum  of  the  number  of  symbols  of  all  the  stored  terms. 

Although  pointers  to  terms  are  stored  in  exactly  the  same  number  of  GetTerms  sets, 
there  are  more  sets  for  path  indexing  than  for  coordinate  indexing.  Thus,  the  number  of 
nodes  in  the  discrimination  net  used  to  locate  GetTerms  sets  is  greater  for  path  index¬ 
ing  than  coordinate  indexing.  The  number  of  nodes  is  related  to  the  number  of  possible 
coordinates  in  coordinate  indexing  and  to  the  number  of  possible  paths  in  path  indexing. 

For  example,  consider  the  following  rewrites  which  form  a  complete  set  of  reductions  for 
free  groups.  When  they  are  used  to  simplify  terms,  their  left-hand  sides  must  be  indexed 
for  retrieval: 

/(e,x)  X 

f{x,e)  X 

f{g{x),x)  e 
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f{x,g{x))  -V  e 

5(e)  e 

5(5(2:))  X 

/(5(2),/(2,y))  y 

f(x,f{g(x),y))  y 

,5(/(2:,5))  f{9iy),9{^)) 


The  size  of  GeiTerms  sets  for  the  stored  left-hand  sides  is  shown  below. 


Coordinate  Indexing 

Path  Indexing 

\GeiTerms{{),  f)\  =  7 

\GeiTerms{{)^  j)\  =  7 

|G'etTer7ns((),5)|  =  3 

\GetTerms{(^^g)\  =  3 

\GeiTerms[{\),  +)|  =  3 

\GeiTerms{{f,  1),  +)|  =  3 

|GetTerms((l),  e)|  =  2 

lGfi<Terms((/,  l),e)|  =  1 
jGeiTerms((5, 1),  e)|  =  1 

|(jefrerms((l), /)|  =  2 

\GeiTerms({f,  1),  /)|  =  1 
jGetrerms((5,  !),/)!=  1 

|Getre7'ms((l),5)|  =  3 

\GeiTerms{{J ,  1),5)|  =  2 
\GetTerms{{g,  1),  p)!  =  1 

|GefTerms((2), +)|  =  3 

|(?efrerms((/,  2),*)|  =  3 

|GeiTerms((2),  e)|  =  1 

|GefTerms((/,2),e)|  =  1 

\GeiTerms{{2),  f)\  =  2 

|(?etTerms((/,2),/)|  =  2 

|Getrerms((2),5)|  =  1 

\GetTerms{{f,2),  g)\  =  1 

|G'etrer7ns((l,  1),  *)|  =  5 

iGetTerm3((/,l, /,!),*)!=  1 
|Getre7-m5((/,  1,5,1),+)!  =  2 
\GetTerms{{g,lJ,l),*)\  =  1 
jGetrerms((5, 1,5,1),*)!=  1 

|G^eirerms((l,2),  +)|  =  2 

C?etTerms((/,l,/,2),+)  =  1 
jC?etrerms((,5,l,/,2),*)!=  1 

\GeiTerms{{2, 1),  *)|  =  2 

\GeiTerms{{,f,2,f,l),*)\-  1 
!Getrerms((,/,2,5,l),*)j=  1 

|C?etrerms((2,l),5)j  =  1 

iGetrerms((,/,2,/,l),5)|=  1 

\GeiTerms{{2,2)  ,*)\  =  2 

|G'etrerms((,/,2,/,2),+)|  =  2 

|Geirerms((2,l,  l),+)j=  1 

i<?etrerms((,/,2,/,l,5,l),*)l=l 

The  indexing  structure  for  these  terms  requires  24  nodes  for  coordinate  indexing  and  38 
nodes  for  path  indexing.  For  coordinate  indexing,  each  of  16  GeiTerms  sets  is  stored  in  a 


single  node  and  8  additional  nodes  represent  coordinates:  (),  (1),  (2),  (1,1),  (1,2),  (2,1), 
(2, 2),  (2, 1, 1).  For  path  indexing,  each  of  24  GeiTerms  sets  is  stored  in  a  single  node  and 
14  additional  nodes  represent  partial  paths:  (),  (/,!),  (/,2),  (p,l),  (/,!,/,!),  (/,!,/, 2), 


(/,1,P,1),(/,2,/,1),(/,2,/,2),{/,2,p,1),(5,1,/,1),(5,1,/,2),(5,1,5,1),(/,2,/,1,5,1). 
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An  optimization  that  reduces  the  length  of  paths  and  number  of  nodes  for  path  indexing 
follows  from  the  observation  that  if  p  is  a  unary  function,  the  path  (<7, 1,/,  1)  could  instead 
be  given  as  (p,/,  1)  because  an  argument  of  g  is  always  argument  number  1.  In  a  further 
optimization,  1  can  always  be  omitted,  so  that  (/,  l,p,l)  and  (/,  2,p,  1),  which  denote  the 
first  argument  position  of  g  in  the  first  argument  position  of  /  and  the  first  argument 
position  of  g  in  the  second  argument  position  of  /,  would  instead  be  written  as  {f,g)  and 
(/,2,p). 


5.3  Application  to  Unordered  Terms 

Path  indexing’s  retrieval  of  smaller  sets  that  more  precisely  match  the  term  in  the  retrieval 
request  makes  it  more  feasible  to  change  the  definition  of  path  to  allow  retrieval  of  unordered 
collections  of  terms. 

Suppose,  for  example,  that  j  is  a  commutative  function.  If  the  term  t  =  j{a,b)  is  stored, 
we  would  want  to  retrieve  it  if  seeking  variants  of  j{b,a)  or  instances  of  j{x,a).  Although 
in  this  case  it  would  be  easy  to  retrieve  variants  of  both  j{a,b)  and  j{b,a)  or  instances  of 
both  j{a,x)  and  j(a:,  a),  this  would  be  more  complex  and  costly'in  cases  where  the  terms 
have  more  than  one  function  with  unordered  arguments. 

Storage  of  j{a,b)  would  ordinarily  be  done  in  path  indexing  according  to  the  relation 

Symboltd))  =j 
Symbolt{{j,  1))  =  a 
Symbolt{{j,2))  =b 

Thus,  pointers  to  t  would  be  stored  in  the  sets  GeiTerms{{),j),  GetTerms({j,l),a), 
and  GetTerms{{j,2),b). 

In  a  refinement  of  path  indexing  for  unordered  terms,  j(a,b)  could  be  stored  according 
to  the  relation 


Symboltii))  =j 
Symbolt{{j,l))  =a 
Symbolt{{j,l))  =6 
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where  Symbolt{{j,  ?))  =  a  specifies  that  a  occurs  as  an  argument  of  top  function  symbol  j. 
Thus,  pointers  to  i  would  be  stored  in  the  sets  GeiTeTms{{),j\  GetrerTns((j,  ?),  a),  and 
GeiTeTms{{j,  ?),  6). 

The  GeiVaHants,  Geilnsiances,  GeiGeneTaLizaiions,  and  GeiUnifiables  retrieval 
formulas  are  likewise  modified  to  substitute  ?  for  argument  indices  in  the  case  of  unordered 
arguments. 

This  is  more  feasible  in  path  indexing  than  coordinate  indexing.  Doing  likewise  in  coordi¬ 
nate  indexing  would  result  in  pointers  to  t  being  stored  in  GetTeTms{{),j)  GetT€TTns{{1),a), 
and  G€tTeTms{{1),b).  But  sets  like  G€iT€Tms{{l),a)  appear  to  be  too  undiscriminating 
in  their  membership  to  be  practically  useful,  since  they  contain  all  terms  with  unordered 
argument  a  regardless  of  function  symbol. 


6  Comparison  with  Other  Indexing  Methods 

6.1  Discrimination  Net 

In  refining  the  coordinate-indexing  method  to  obtain  the  path-indexing  method,  we  made 
retrieval  more  sensitive  to  context.  We  would  retrieve  GeiTeTms({f,l),a),  the  set  of  all 
terms  with  a  as  the  first  argument  of  top  function  symbol  /,  instead  of  GetTe7’ms((l),o), 
the  set  of  all  terms  with  a  as  the  first  argument  regardless  of  function  symbol. 

Retrieval  can  be  made  even  more  sensitive  to  context.  To  make  retrieval  sensitive  to  all 
symbols  to  the  left  of  the  current  one,  the  term  t  —  /(a,  5(6,  h{y,  z)))  can  be  specified  by 
a  mapping  Symbolt  of  the  tree  in  Figure  1  to  the  symbols  in  t: 

SymboltiO)  -f 
Symbolt{{f))  =0 
Symbolt({f,a))  =g 
Symbolt{{f,a,g))  -b 
Symbolt{{f,a,g,b})  = 

Symbolt{{f,a,g,b,*))  =h 
Symbolt{{f,a,g,b,*,h))  =* 

Symbolt{{f,a,g,b,*,h,*))  =* 
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This  mapping  supports  the  use  of  a  discrimination  net  or  trie  [2,3,6].  A  discrimination 
net  can  be  used  to  store  the  strings  of  symbols  obtained  by  the  preorder  traversal  of  terms 
to  be  indexed. 

Nodes  in  the  discrimination  net  are  reached  from  their  parent  nodes  by  the  symbol 
lookup  operation:® 


sip  :  node  x  symbol  — >  node. 

The  sip  symbol  lookup  operation  can  be  implemented  in  a  variety  of  ways  as  discussed  in 
Section  4.1.  Terms  can  be  stored  in  the  terms  field  of  nodes.  Let  Nq  be  the  top  node  of 
the  discrimination  net. 

Let  Terms  be  the  set  of  terms  stored  for  possible  retrieval.  For  each  term  i  in  Terms 
with  preorder  traversal  (sj, . . .  ,5n)  (e-g-,  {fya,g,b,  *,  h,*,  *)  for  t  =  f{a,g{b,x,  h{y,  ^:)))), 

t  e  ierms{slp{- -  •  {slp{Na,  Sj),  •  •  •),  s„)). 

The  following  can  be  used  to  define  methods  for  retrieving  those  terms  in  Terms  that  are 
variants  of,  instances  of,  generalizations  of,  or  unifiable  with  a  term.  GetVariants{No,  {t)) 
returns  a  subset  of  Terms  that  includes  all  variants  of  t;  Getlnsiances{No,{i))  returns 
a  subset  that  includes  all  instances;  GetGeneralizations[No,{t))  returns  a  subset  that 
includes  all  generalizations;  GeiUni fiables{No,  {t))  returns  a  subset  that  includes  all  terms 
unifiable  with  t.  If  i  and  Terms  are  all  linear  terms,  then  these  retrieval  operations  are 
exact.  If  there  are  nonlinear  terms,  extra  terms  may  be  retrieved. 

GetVariants{N ,{))  =  terms{N) 

GetVariants{N y{x,t2, . .  .,tm))  =  GetVariants{slp{*).,  (t2,  •  ■  -  .tm)) 
GetVariants{N ,  {a,t2, .  ■  ■  ,tm))  =  GetVariants{slp{a),{t2, . .  .,tm)) 

GetV ariants{N,  , , . . ,  tj,),  t2, . . . ,  tm))  =  GetV ariants{slp{f) ,  (f^,  t2 ,...,  tm)) 

^Asin  coordinate  and  path  indexing,  we  ignore  identity  of  variables,  so  all  are  replaced  by  *.  It  is  feasible 
instead  to  retain  variable  names  and  bind  variables  while  traversing  the  discrimination  net.  The  search  for 

terms  can  then  be  pruned  when  variable  bindings  conflict  [6]. 

®To  simplify  this  description,  we  treat  *  as  any  other  symbol.  To  save  lookup  time,  the  successor  node 

for  a  variable  should  be  stored  in  a  separate  field  in  the  node. 
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GetInstances{N,{))  =  terms^N) 

GetInstances{N,{*,i2,-“,tm))  =  (J  GetInstances{M,{t2, . . .  ,tm)) 

Meskip{N) 

GetInstances{N,{a,t2T...,im))  =  Getlnsiances{slp(a),{i2, . . .  ,im)) 

GetInstances{N,  {f{i\ , . . . , t^,),  f2,  •  •  • , ^m))  =  Geilnstances{slp{f),  (fj , . . . ,  ^2)  •  •  • ,  ^m)) 

GeiGeneralizations{N,{})  =  teTms{N) 

GeiGeneTalizatioTLs{N,  {x,t2, . . . ,  tm))  =  GetGeneralizaiions{slp{*),  (^2,  • 

GetGeneTalizations{N,{a,t2,...,tm))  =  GetGeneTalizations{slp{*),{i2, . . 

GetGeneTalizations{slp{a),  {t2, . . . ,  tm)) 

GetGeneTalizations{N j  (/(ij, .  -  ■  •  •,  ^m))  =  GeiGeneTalizations{slp{*),  (^2,  ■  ■  -ytm))  U 

=  GetGeneTalizations{slp{f),  (fj, . .  ^2,  •  •  ■ ,  tm)) 

GetUnifiables{N,{))  =  terTns{N) 

GetUnifiables{N,{*,t2j....,tm))  =  IJ  GetUnifiables{M,{t2,  ■  ■  .,tm)) 

M^Skip{N) 

GeiUnifiables(N,{a^t2T..,tm))  =  GetUnifiables{slp(*),{t2',- •  .,tm))'J 

GetUnifiables{slp{a)f  {t2, . . . ,  tm)) 

GetUnifiables{N,  {/(tj, .  ‘•,tn),  h,  •  •  •  j^m))  =  GetUnifiables{slp{*),  {t2, . .  U 

=  GetUnifiables{slp{f), {t\, . .  .,t'^,t2,. . im)) 

One  auxiliary  function  is  necessary.  Skip{N)  is  the  set  of  nodes  in  the  discrimination  ' 
net  obtained  by  skipping  over  all  the  descendant  nodes  that  correspond  to  skipping  a  single 
term.  For  all  symbols  s  for  which  slp{N,s)  is  defined,  if  s  has  arity  0  (s  is  a  constant  or  *), 
then  slp{N,s)  6  Skip[N)]  if  s  has  arity  n  >  0,  then  Skip^(slp[N,s))  C  Skip{N). 

The  Skip  function  can  be  implemented  as  described  or,  alternatively,  the  value  of 
Skip{N)  can  be  stored  as  an  explicit  list  in  node  N.  The  latter  should  be  much  more 
efficient,  since  it  eliminates  arity  computations  and  traversing  intermediate  nodes.  The 
extra  storage  required  for  such  lists  is  negligible.  Although  5Azp  lists  can  be  long,  any  node 
can  only  be  an  element  of  a  single  Skip  list,  so  the  cumulative  cost  of  Skip  lists  is  one  list 
element  per  node. 
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Unlike  coordinate  and  path  indexing,  discrimination-net  indexing  does  not  use  GetT erms 
sets  and  does  not  need  to  compute  any  intermediate  results  by  set  union  and  intersec¬ 
tion  operations.  Discrimination-net  indexing  retrieves  all  terms  in  terTns{N)  for  all  nodes 
N  in  the  final  recursive  calls  of  GetVarianis^  Getinstances,  GetGeneralizations,  and 
GetUnifiables. 

In  coordinate  and  path  indexing,  the  number  and  size  of  the  GetTerms  sets  is  usually 
the  dominant  factor  in  retrieval  time.  In  discrimination-net  indexing,  the  number  of  symbol 
lookup  operations  is  the  dominant  factor. 

For  variant  retrieval,  F  +  C  symbol  lookup  operations  must  be  performed,  where  F  is 
the  number  of  function  symbols  and  C  is  the  number  of  constant  symbols  in  the  retrieval 
term.  As  noted  before,  slp(*)  operations  for  variables  should  not  be  performed  by  real 
symbol  lookup  operations  and  are  therefore  not  counted  here. 

For  generalization  retrieval,  the  number  of  symbol  lookup  operations  is  only  exponen¬ 
tially  bounded  by  the  number  of  symbols  in  the  retrieval  term.  For  example,  retrieval  of 
generalizations  of  the  term  /(cj, . . . ,  a„)  with  n-ary  function  symbol  /  and  constant  argu¬ 
ments  Cl, . . may  require  2”  symbol  lookup  operations  (fewer  if  successor  nodes  do  not 
always  exist). 

For  uniliable  and  instance  retrievals,  the  number  of  symbol  lookup  operations  is  no 
longer  bounded  by  the  number  of  symbols  in  the  retrieval  term.  The  number  of  symbol 
lookup  operations  depends  in  part  on  the  size  of  the  sets  of  nodes  computed  by  Skip{N), 
which  depends  on  the  number  and  structure  of  the  stored  terms. 

A  contrived  example  to  demonstrate  the  possibility  of  poor  behavior  in  discrimination- 
net  indexing  is  the  storage  of  the  addition  table  plu3{m,  n,  m  -1-  n)  for  m  and  n  in  the  range 
[0,999]: 


plv.s{ 

0, 

0, 

0) 

plus{ 

0, 

999, 

999) 

plus( 

1, 

0, 

1) 

plus( 

999, 

999, 

1998) 
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and  the  retrieval  of  instances  of  t/,  150).  Path  indexing  immediately  finds  the  so¬ 

lutions  in  6'etT'erms((pftiS,3),  150)  while  discrimination-net  indexing  must  traverse  essen¬ 
tially  the  entire  discrimination  net,  whose  size  exceeds  1  million  nodes.  On  the  other  hand, 
discrimination-net  indexing  immediately  finds  instances  of  plus{70, 80,  z)  while  path  index¬ 
ing  intersects  the  1,000-element  GetTerms({plus,l),7Q)  and  GetTerms{{pl us,  2),  80)  sets 
in  time  proportional  to  the  sum  of  the  sizes  of  the  sets. 

For  several  years,  one  of  our  deduction  systems  [9]  relied  on  the  undocumented  use  of 
discrimination-net  indexing.  We  developed  path  indexing  as  an  alternative  to  discrimination- 
net  indexing  to  overcome  some  limitations  of  discrimination-net  indexing.  We  had  been 
dissatisfied  with  our  handling  of  functions  that  are  associative  and/or  commutative.  A  pro¬ 
vision  for  unordered  arguments  in  path  indexing  partially  addresses  this  concern  (see  Sec¬ 
tion  5.3).  Discrimination-net  indexing  can  have  variable  storage  requirements  and  retrieval 
times  that  are  sensitive  to  the  encoding  of  the  input.  For  example,  storing  father{john,  bill) 
and  faiher{john,mary)  requires  fewer  nodes  then  storing  them  with  argument  order  re¬ 
versed  because  they  have  common  initial  instead  of  common  tail  subsequences  in  their  pre- 
order  traversals.  Storage  requirements  and  retrieval  time  in  path  indexing  do  not  depend 
on  argument  order.  Storage  requirements  for  discrimination-net  indexing  caji  sometimes  be 
excessive.  Path  indexing  generally  requires  less  space. 

Nevertheless,  discrimination-net  indexing  remains  very  competitive.  Discrimination- 
net  indexing  performs  no  set  union  and  intersection  operations.  Variant  retrieval  is  very 
rapid.  Generalization  retrieval  also  often  performs  well,  despite  the  absence  of  a  polynomial 
bound  on  the  number  of  symbol  lookup  operations  based  on  the  size  of  the  retrieval  term. 
McCune  has  suggested  that  discrimination-net  indexing  be  used  specifically  for  retrieving 
generalizations  [6]  and  has  investigated  its  use  for  other  types  of  retrieval.  Although  the 
number  of  symbol  lookup  operations  is  not  bounded  by  the  size  of  the  retrieval  term  for 
instance  and  unifiable  retrievals,  instance  and  unifiable  retrieval  can  be  accelerated  by  using 
lists  of  pointers  to  successor  nodes  for  skipped  terms  in  the  discrimination  net  instead  of 
traversing  the  skipped  discrimination-net  nodes.  Retrieval  time  is  still  always  bounded  by 
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the  number  of  stored  terms  and  the  size  of  the  discrimination  net. 

Christian’s  HIPER  (High  PErformance  Rewriting)  system  [3]  for  extended  Knuth- 
Bendix  completion  includes  very  efficient  code  for  discrimination-net  indexing.  Permutation 
of  arguments  in  the  retrieval  term  permits  use  of  permutative,  including  commutative,  but 
not  associative,  functions.  Overwhelmingly  more  generalization  than  instance  retrievals  are 
usually  necessary  in  Knuth-Bendix  completion.  This  diminishes  the  impact  of  possibly  poor 
behavior  of  discrination  nets  for  instance  retrieval.  Christian’s  linear  term  representation  is 
particularly  efficient  for  use  in  discrimination-net  retrievals  because  it  includes  the  sequence 
of  symbols  in  the  preorder  traversal  of  the  term. 


6.2  Codeword  Indexing 

Another  approach  to  indexing  terms  is  codeword  indexing  [8,10].  We  describe  here  the  su¬ 
perimposed  codewrd  indexing  scheme  of  Ramamohanrao  and  Shepherd  [8].  A  superimposed 
codeword  indexing  scheme  for  a  conventional  relational  database  generates  a  descriptor  for 
a  record  by  ORing  codewords  associated  with  key  values. 

Consider  the  example  of  aji  employee  (Name , Position, Department , Salary)  relation. 
Each  possible  field  value  has  an  associated  code  word,  e.g.. 


john  00010  00000  10000 
clerk  01001  00000  00000 
admin  00010  00100  00000 
22000  00000  00010  00001 


jane  10000  00000  00001 

manager  00000  01000  01000 

sales  00100  00000  10000 

32000  01000  00000  01000 


A  descriptor  for  the  record  employee  (john,  clerk,  admin, 22000)  is  computed  by  OR¬ 
ing  the  codewords  for  john,  clerk,  admin,  and  22000; 

john  00010  00000  10000 
clerk  01001  00000  00000 
admin  00010  00100  00000 
22000  00000  00010  00001 
01011  00110  10001 

Query  descriptors,  such  as  for  employee (_, clerk, admin,  J,  are  computed  by  ORing 
the  codewords  for  all  the  specified  fields: 
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clerk  01001  00000  00000 
admin  00010  00100  00000 
01011  00100  00000 


A  record  might  satisfy  a  query  if  the  result  of  ANDing  the  descriptors  is  equal  to  the 
query’s  descriptor.  False  matches  are  possible  because  different  terms  may  coincidentally 
have  similar  descriptors.  Their  frequency  depends  in  pzirt  on  the  number  of  fields,  the 
codeword  length,  and  the  distribution  of  1-bits  in  the  codewords. 

In  principle,  the  retrieval  procedure  ANDs  each  record  descriptor  with  the  query  de¬ 
scriptor.  Whenever  the  result  is  equal  to  the  query  descriptor,  the  record  is  a  possible 
match  for  the  query.  In  practice,  retrieval  can  be  made  more  efficient  by  using  a  bit-slice 
representation  of  all  the  descriptors  and  examining  only  those  bit-slices  corresponding  to 
1-bits  in  the  query  descriptor. 

To  handle  field  Vcdues  that  are  first-order  predicate  calculus  terms,  it  is  necessary  to 
extend  this  standard  superimposed  codeword  scheme. 

Unlike  all  the  previously  discussed  forms  of  indexing,  which  can  index  on  every  symbol 
in  a  term,  it  is  necessary  in  codeword  indexing  to  specify  which  argument  positions  are 
indexed. 

For  example,  suppose  terms  such  asp(f(a,X),d),p(X,d),p(f(a,b),X),  and  p(g(c,b),X) 
are  of  interest  and  only  the  numbered  positions  of  p(l(2,3)  ,4)  are  indexed. 

Using  the  codewords 

a  01001  00000  00000 
b  00000  01010  00000 
c  10000  00000  00010 
d  00010  00100  00000 
f  00010  00000  10000 

g  00000  00001  00001 


the  example  terms  have  the  following  descriptors  obtained  by  ORing  the  codewords  for  the 
field  values  and  adding  mask  bits  denoting  fields  that  contain  variables 


p(f(a.X),d) 
pCX.d) 
p(f(a,b),X) 
p(g(c,b) ,X) 


01011  00100  10000  0010 
00010  00100  00000  1**0 
01011  01010  10000  0001 
10000  01011  00011  0001 
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The  final  four  bits  of  each  descriptor  denote  whether  that  field  contains  a  variable.  The 
value  **’  for  a  mask  bit  indicates  that  this  bit  position  is  irrelevant  for  retrieval  purposes; 
it  appears  in  mzisk  bits  corresponding  to  positions  2  and  3  of  the  term  1(2,3)  where  the 
term  is  a  variable  and  fields  2  and  3  are  absent. 

Now  suppose  terms  unifiable  with  p(f(a,X),Y)  are  to  be  retrieved.  If  the  descriptor 
bits  are  denoted  by  1,  2,  . . 15  and  the  mask  bits  by  ml,  in2,  m3,  m4,  then  terms  unifiable 
with  p(f(a,X),Y)  will  have  codewords  that  satisfy  the  formula  ml  OR  ((4  AND  11)  AND 
(m2  OR  (2  AND  5))).  Thus,  either  (1)  position  1  must  be  filled  by  a  variable  or  (2)  f  (bits 
4  and  11)  must  be  present  and  either  (2a)  position  2  is  a  variable  or  (2b)  a  (bits  2  and 
5)  must  be  present.  Such  retrieval  formulas  are  reminiscent  of  those  used  in  coordinate 
indexing,  but  are  less  exact  because  they  do  not  compare  symbols,  only  bits  from  their 
descriptors,  and  ignore  their  locations  in  the  record. 

This  superimposed  codeword  indexing  sclieme  was  devised  for  indexing  large  Prolog 
databases.  Two  obvious  extensions  are  required  for  general  and  efficient  indexing  of  terms 
in  broader  contexts. 

The  first  extension  is  just  the  specification  of  retrieval  formulas  for  variant,  instance, 
and  generalization  terms  as  well  as  unifiable  terms. 

The  simple  ORing  of  codewords  to  form  term  descriptors  creates  a  presumption  that 
where  a  value  appears  in  a  term  does  not  matter.  This  is  sometimes  true,  as  in  the  employee 
relation,  where  values  of  the  Name,  Position,  Department,  and  Salary  fields  cannot  ap¬ 
pear  in  any  other  field.  However,  it  is  often  necessary  to  distinguish  between  terms  such  as 
f  (g(a,b)  ,b)  and  f  (g(b,a)  , a),  which  axe  assigned  the  same  descriptor  by  the  above  pro¬ 
cedure.  The  second  extension  is  to  use  different  codewords  for  a  symbol  depending  on  the 
position  of  its  occurrence  in  the  term,  e.g.,  by  rotating  a  symbol’s  codeword  some  number 
of  bits  depending  on  the  position. 

Codeword  indexing  has  several  advantages,  notably  simplicity,  low  storage  requirements, 
and  use  of  efficient  logical  operations  on  bits. 

However,  it  has  several  disadvantages.  Perhaps  most  significantly,  its  time  complexity 
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is  0{N),  where  N  is  the  number  of  stored  terms,  because  the  descriptor  of  each  stored 
term  must  be  compared  with  the  query.  Performing  this  computation  on  bit-slices  reduces 
the  amount  of  work  (by  performing  operations  only  on  bit-slices  corresponding  to  1-bits 
in  the  query  descriptor)  and  may  employ  parallelism  to  do  the  remaining  work  (logical 
operations  on  bit-slices  can  be  performed  in  computer-word-size  chunks)  but  the  method 
remains  0{N).^ 

False  matches  can  be  returned  if  there  are  coincidentally  similar  descriptors.  The  other 
methods  are  exact  retrieval  methods  except  in  the  case  of  nonlinear  terms.  They  would 
never  retrieve  a  term  with  a  constant  or  function  symbol  that  failed  to  match  a  constant  or 
function  symbol  in  the  query, 

Codeword  indexing  requires  specification  of  which  argument  positions  are  indexed.  If 
terms  of  unrestricted  size  and  structure  are  stored,  codeword  indexing  will  still  be  able  to 
index  only  the  specified  argument  positions.  This  results  in  less  complete  indexing  than  the 
other  methods  that  can  index  every  symbol  in  the  term. 

7  Conclusion 

We  have  presented  the  path-indexing  method  for  indexing  terms.  It  is  a  refinement  of  the 
standard  coordinate-indexing  method  that  was  proposed  for  the  PLANNER  AI  program¬ 
ming  language  and  was  used  in  LMA  for  implementing  deduction  systems.  Path  indexing 
nearly  always  retrieves  terms  by  computing  unions  and  intersections  of  fewer,  smaller  sets  of 
terms  than  coordinate  indexing.  Path  indexing  requires  somewhat  more  space  than  coordi¬ 
nate  indexing,  but  we  expect  that  the  substantially  faster  path  indexing  will  nearly  always 
be  useable  whenever  coordinate  indexing  is. 

However,  the  superiority  of  path  indexing  over  coordinate  indexing  does  not  demon¬ 
strate  that  it  should  be  the  method  of  choice  in  general.  We  have  described  and  partially 
®  Hietatchica]  codeword  indexing  may  overcome  the  0{N)  time  requirement  in  practice,  but  this  involves 

the  greater  complexity  of  multiple  indexing,  initially  with  much  wider  codewords. 

'°One  consequence  of  this  feature  is  that  a  faster  version  of  the  unification  algorithm  that  does  not  check 

whether  constants  and  functions  ate  identical  can  be  used  to  unify  the  query  with  retrieved  terms. 


24 


analyzed  the  behavior  of  two  important  alternative  methods:  discrimination-net  indexing 
and  codeword  indexing.  Unfortunately,  formal  comparison  of  the  methods  will  generally 
fail  to  justify  a  preference  for  one  method,  since  the  worst-case  assumptions  employed  by 
most  formal  analyses  are  unrealistic,  and  average-case  analyses  are  difficult  to  produce  and 
require  definition  of  the  average  case. 

Therefore,  optimal  selection  of  an  indexing  method  may  require  experiments  on  sets 
of  terms  that  occur  in  practice,  with  the  correct  proportion  of  variant,  instance,  general¬ 
ization,  and  unifiable  retrievals  as  well  as  addition  and  deletion  operations.  For  example, 
discrimination-net  indexing  might  be  preferable  for  some  sets  of  terms  if  instance  and  unifi¬ 
able  retrieval  is  much  less  frequent  than  variant  and  generalization  retrieval.  Indexing  terms 
in  more  than  one  way  (e.g.,  discrimination  net  for  variant  and  generalization  retrieval,  and 
path  indexing  for  instance  and  unifiable  retrieval)  may  be  a  useful  approach  to  getting  the 
best  performance. 

We  have  formulated  the  following  guidelines  for  selecting  a  method.  If  speed  is  the 
paxaraouiit  concern,  discrimination-net  indexing  is  likely  to  be  best,  although  there  are 
exceptions  to  this  rule.  If  storage  requirements  are  an  issue,  or  discrimination-net  indexing 
works  poorly  in  a  particular  case,  coordinate  or  path  indexing  can  be  used.  The  latter  offers 
much  greater  speed  at  the  cost  of  some  increase  in  space.  If  the  amount  of  storage  required 
must  be  minimized,  codeword  indexing  can  be  used.  The  bit  manipulation  operations  in 
codeword  indexing  are  individually  quite  fast,  but  the  method  has  the  disadvantages  that 
retrieval  time  is  linear  in  the  number  of  stored  terms,  retrieval  is  less  exact  than  for  the 
other  methods,  and  terms  cannot  be  indexed  arbitrarily  deeply. 
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